Automatic Acquisition of Class-based Rules for Word Alignment

نویسندگان

  • Sur-Jin Ker
  • Jason J. S. Chang
چکیده

In this paper, we describe an algorithm for aligning words with their translation in a bilingual corpus. Existing algorithms require enormous bilingual data to train statistical word-to-word translation models. Using word-based approach, frequent words with consistent translation can be aligned at a high precision rate. However, less frequent words or words with diverse translations usually do not have statistically significant evidence for confident alignment. Incomplete or incorrect alignments consequently result. Our algorithm attempts to handle the problem using a hierarchical class-based approximation of translation probabilities. The translation probabilities are estimated using class-based models on 3 levels of specificity. We found that the algorithm can provide translation probability for more word pairs at the cost of slightly lower degree of precision, even when a small corpus was used in training. We have achieved an application rate of 81.8% and precision rate of 93.3%. The algorithm also offer the advantage of producing word-sense disambiguation information.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Aligning More Words with High Precision for Small Bilingual Corpora

In this paper, we propose an algorithm for identifying each word with its translations in a sentence and translation pair. Previously proposed methods require enormous amounts of bilingual data to train statistical word-by-word translation models. By taking a word-based approach, these methods align frequent words with consistent translations at a high precision rate. However, less frequent wor...

متن کامل

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

Enriching the "Senso Comune" Platform with Automatically Acquired Data

This paper reports on research activities on automatic methods for the enrichment of the Senso Comune platform. At this stage of development, we will report on two tasks, namely word sense alignment with MultiWordNet and automatic acquisition of Verb Shallow Frames from sense annotated data in the MultiSemCor corpus. The results obtained are satisfying. We achieved a final F-measure of 0.64 for...

متن کامل

Automatic Phrase Alignment Using statistical n-gram alignment for syntactic phrase alignment

A parallel treebank consists of syntactically annotated sentences in two or more languages, taken from translated (i.e. parallel) documents. These parallel sentences are linked through alignment. Much work has been done on sentence and word alignment, but not as much on the intermediate level. This paper explores using n-gram alignment created for statistical machine translation based on GIZA++...

متن کامل

Variation Sets Facilitate Artificial Language Learning

Variation set structure — partial alignment of successive utterances in child-directed speech — has been shown to correlate with progress in the acquisition of syntax by children. The present study demonstrates that arranging a certain proportion of utterances in a training corpus in variation sets facilitates word segmentation and phrase structure learning in miniature artificial languages by ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995